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Abstract — In  this  paper,  we  derive  an  algorithm  similar  to  the 
well-known  Baum-Welch  algorithm  for  estimating  the  parameters 
of  a  hidden  Markov  model  (HMM).  The  new  algorithm  allows  the 
observation  PDF  of  each  state  to  be  defined  and  estimated  using  a 
different  feature  set.  We  show  that  estimating  parameters  in  this 
manner  is  equivalent  to  maximizing  the  likelihood  function  for  the 
standard  parameterization  of  the  HMM  defined  on  the  input  data 
space.  The  processor  becomes  optimal  if  the  state-dependent  fea¬ 
ture  sets  are  sufficient  statistics  to  distinguish  each  state  individu¬ 
ally  from  a  common  state. 

Index  Terms — Baum-Welch  algorithm,  class-specific,  EM 
algorithm,  expectation-maximization,  Gaussian  mixtures,  hidden 
Markov  model  (HMM),  parameter  estimation,  sufficient  statistics. 


I.  Introduction 

THE  class-specific  method  was  recently  developed  as  a 
method  of  dimensionality  reduction  in  classification  [1], 
[2],  Unlike  other  methods  of  dimension  reduction,  it  is  based 
on  sufficient  statistics  and  results  in  no  theoretical  loss  of 
performance.  Performance  is  always  lost  going  from  theory 
to  practice  due  to  (1)  loss  of  information  when  reducing  data 
to  features,  and  (2)  approximation  of  the  theoretical  feature 
PDFs.  There  is  always  a  tradeoff  between  the  desire  to  retain 
as  much  information  as  possible  (by  increasing  the  feature 
dimension)  and  the  desire  to  obtain  better  PDF  estimates  (by 
decreasing  the  dimension).  The  class-specific  method  obtains  a 
better  compromise  by  allowing  more  information  to  be  kept  for 
a  given  maximum  feature  dimension.  It  does  this  by  assigning 
a  separate  feature  set  to  each  class.  Now  we  extend  the  idea 
further  to  the  problem  of  HMM  modeling  when  each  state  of 
the  HMM  may  have  its  own  approximate  sufficient  statistic. 

II.  Mathematical  Results 

We  show  in  this  section  that  the  class-specific  HMM  is 
merely  a  different  way  to  parameterize  the  likelihood  function 
of  the  conventional  HMM.  Fet  L(X;  A)  be  the  likelihood 
function  defined  for  the  input  data  X.  A  special  class-spe¬ 
cific  likelihood  function,  LZ(Z:  X  ')  is  defined  using  the 
class-specific  (state-specific)  statistics  Z.  It  is  shown  below  that 
maximizing  Lz( Z;AZ)  over  A*  is  equivalent  to  maximizing 
L(X;  A)  over  A  with  special  constraints.  While  it  is  not  nec¬ 
essary  for  Z  to  be  sufficient  for  this  to  be  true,  the  processor 


constructed  from  class-specific  sufficient  statistics  will  be 
optimal,  provided  there  is  no  PDF  estimation  error. 

A.  Standard  Parameterization  and  Notation 

/\ 

We  consider  a  set  of  state  occurrences  9  —  {qi . . .  qx}  where 
1  <  qt  <  N .  The  sequence  9  is  a  realization  of  the  Markov 
chain  with  state  priors  {n j,j  —  1,2...  N}  and  N  X  N  state 

transition  matrix  A  —  { a, j } .  Rather  than  observing  the  states  9 

,\ 

directly,  we  observe  the  stochastic  outputs  X  =  {xi ,  X2  . . .  x^} 
which  are  realizations  from  a  set  of  state  PDFs 

A(x)  =  .7  =  1,2...  A 

where  Hj  is  the  condition  that  state  j  is  true.  We  assume  the 
observations  are  independent,  thus 

T 

P(x\9) = n^(xt)- 

t= i 

The  complete  set  of  parameters  defining  the  HMM  are 


A 41 

where  7r j  —  1-  22y=i  av  ~  1-  The  likelihood  function 

is  the  joint  density  of  the  observation  sequence  given  the  model 
parameters  and  is  written  (see  [3,  Eq.  17]) 


L(X;A)^p(X;A)=5>(M;A) 

9 


T 

=  (xi;  A)  PJ  aqt-iqtPqi  (xt, 

9  t=2 


(1) 


where  J^9  is  a  summation  over  all  possible  state  sequences  of 
length  T.  The  maximum  likelihood  (ME)  estimate  of  A  is  de¬ 
fined  as 


A  =  ai'ginaxL(X;  A). 
A 


(2) 
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We  use  notation  similar  to  Rabiner  [3]  with  the  exception  that 
we  represent  state  PDFs  as  pj(-),  and  observations  as  x(.  In  the 
paper,  functions  beginning  with  the  letters  “6”  and  “p"  always 
denote  PDFs.  The  letter  “6”  is  reserved  for  components  of  mix¬ 
ture  PDFs  and  “p”  is  used  for  all  other  PDFs.  The  exception  is 
any  function  carrying  the  superscript  which  is  not  a  PDF. 


U.S.  Government  work  not  protected  by  U.S.  Copyright. 


Report  Documentation  Page 


Form  Approved 
OMB  No.  0704-0188 


Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 


1.  REPORT  DATE 

29  AUG  2000 


2.  REPORT  TYPE 


3.  DATES  COVERED 

00-00-2000  to  00-00-2000 


4.  TITLE  AND  SUBTITLE 

A  Modified  Baum-Welch  Algorithm  for  Hidden  Markov  Models  with 
Multiple  Observation  Spaces 

6.  AUTHOR(S) 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Naval  Undersea  Warfare  Center, Newport, RI, 02841 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 


5a.  CONTRACT  NUMBER 


5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

5d.  PROIECT  NUMBER 


5e.  TASK  NUMBER 


5f.  WORK  UNIT  NUMBER 

8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 


10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 


12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 


13.  SUPPLEMENTARY  NOTES 


14.  ABSTRACT 

see  report 


15.  SUBIECT  TERMS 


16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 
ABSTRACT 

18.  NUMBER 
OF  PAGES 

19a.  NAME  OF 
RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Same  as 
Report  (SAR) 

6 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


412 


IEEE  TRANSACTIONS  ON  SPEECH  AND  AUDIO  PROCESSING,  VOL.  9,  NO.  4,  MAY  2001 


B.  Class-Specific  Parameterization 
Define 
,\ 

z  =  {[zi,o  •  •  •  Zjv,o],  [zi,l  •  •  •  ZA;i],  •  •  •  [ziiT  .  .  .  ZAriX]} 
where 

zy,t  =  i  =  t  =  i...r. 

The  complete  class-specific  parameterization  is  written 

where  {7ry},{a,y}  are  identical  to  the  corresponding  compo¬ 
nents  of  A,  and  have  the  same  constraints.  The  state-dependent 
PDFs  are  modeled  as  Gaussian  mixture  densities 

M 

(zy :A'  -  X  ci^(zy  >  /*;/.  >  vh )  (3) 

fc=i 

where  cjj,  =  1  and  Af(zy,  LP)  are  the  joint  Gaussian 
densities 

A^Zy^lU)  =(27r)-P-''/2|Ui|-1/2 

X  6XP  {_^Zy  “  ^)'(US)_1(ZJ  _ 

and  i’  is  the  dimension  of  zy .  The  relationship  between  A*  and 
A  will  be  established  shortly. 

1)  Noise-Only  Condition  Ho :  To  apply  the  class-specific 
method  in  its  simplest  form  [see  note  following  (11)],  we  need 
to  define  a  condition  Ho  that  is  common  to  all  states.  One  way 
to  do  this  is  to  let  Hq  represent  the  “noise-only”  condition.  For 
example,  assume  the  PDF  of  x  in  each  state  is  dependent  on  a 
“signal  amplitude”  parameter  pj.  Under  Hj,  the  PDF  of  x  is 
marginalized  over  the  distribution  of  pj,  thus 

p{x\Hj)  =  [  p{x\Pj,Hj)p(pj\Hj),  1  <j<N. 

Jpj 

Let  there  exist  a  common  noise-only  condition  Ho  defined  by 

p(x|#o)  =p(x|pi  =  o,  Hi)  -  p(x|p2  =  o,  H2) 

=  ■■■  =p(x|pAr  =  0,HN). 

We  assume  x(  are  independent  under  Ho-  Thus, 

X 

KX|tfo)  =  IlKxt|tfo).  (4) 

t=i 

One  further  requirement  is  that 

p(x.\Ho)  >  0,  for  all  x  6  X  (5) 

where  X  is  the  allowable  range  of  xt.  Note  that  this  requirement 
is  met  if  p(x.\Ho)  is  Gaussian. 


While  this  structure  does  not  seem  to  fit  many  problems  of 
interest,  any  problem  can  be  modified  to  include  an  amplitude 
parameter  even  if  the  amplitude  parameter  is  never  zero  in  prac¬ 
tice.  Furthermore,  the  noise-only  condition  Ho  can  have  an  arbi¬ 
trarily  small  assumed  variance  because  Ho  is  only  a  theoretical 
tool  and  does  not  need  to  approximate  any  realistic  situation. 
We  will  explain  how  the  choice  of  Ho-  affects  the  choice  of  the 
state -dependent  statistics. 

2)  Sufficiency  ofZ  and  Relationship  to  Ho-'  We  will  show 
that  if  Z  meets  a  special  sufficiency  requirement,  the  class-spe¬ 
cific  method  becomes  optimum.  To  understand  the  implications 
of  the  sufficiency  of  Z,  we  must  consider  a  conventional  fea¬ 
ture-based  approach  in  which  a  common  feature  set  replaces  the 
raw  data.  Let  {z t  —  T(xt),  1  <  t  <  T}  and  define  the  HMM 
based  on  the  state-dependent  distributions  {pj( z),  1  <  j  < 
N}.  This  is  the  conventional  HMM  approach  which  has  been 
very  successful  [3].  An  example  of  z  is  a  set  of  cepstrum-de- 
rived  features.  For  optimality  of  the  resulting  processor,  z  must 
be  a  sufficient  statistic  for  the  classification  of  the  N  states.  One 
way  to  express  the  sufficiency  requirement  is  through  the  like¬ 
lihood  ratios,  which  are  invariant  when  written  as  a  function  of 
a  sufficient  statistic  [4],  [5].  More  precicely 


p{*\Hk) 


Kz|fly) 

p{z\Hky 


1  <  j ,  k  <N,  j  ±  k. 


(6) 


Clearly,  z  must  contain  all  information  necessary  to  distinguish 
any  two  states.  This  can  be  a  very  difficult  requirement  to  meet 
in  practice  because  a  significant  amount  of  information  can  be 
lost  when  reducing  the  data  to  features.  In  practice,  the  tradeoff 
consists  of  the  contradictory  goals  of  making  z  as  close  to  suf¬ 
ficient  as  possible  (by  making  z  larger),  while  at  the  same  time 
making  the  PDF  estimation  problem  as  tractable  as  possible  (by 
making  z  smaller). 

For  optimality  of  the  class-specific  method,  however,  we  re¬ 
quire  that  z j  —  Tjfx)  be  a  sufficient  statistic  for  the  binary 
hypothesis  test  between  Hj  and  Ho-  Specifically,  if  z j  is  suffi¬ 
cient,  we  have 


Kxl#o) 


Kzj|ff/) 

p(.zj\Ho)  ’ 


1  <  j  <  N. 


(7) 


Clearly,  z j  must  contain  all  information  which  helps  distin¬ 
guish  Hj  from  Ho-  In  contrast  to  the  conventional  method  which 
places  all  information  in  z,  the  class-specific  method  distributes 
the  information  among  the  class-specific  statistics.  For  a  fixed 
feature  set  dimension,  more  information  is  allowed. 

Clearly,  the  selection  of  Ho  affects  the  selection  of  the  feature 
transformations  Tf.  ).  For  optimality  to  hold,  no  information 
which  can  help  distinguish  Hj  from  Ho  can  be  discarded.  An 
example  is  background  noise.  If  Hj  contains  background  noise 
inconsistent  with  Ho-  then  zj  should  contain  information  about 
the  background  noise.  To  reduce  the  complexity  of  the  feature 
sets,  it  may  be  necessary  to  whiten  and  normalize  the  data  in 
such  a  way  that  the  background  noise  resembles  Ho-  On  the 
other  hand,  it  may  be  acceptable  to  discard  the  information  and 
suffer  a  slight  performance  loss.  This  is  especially  true  a  high 
signal-to-noise  ratio  (SNR). 
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3)  Class-Specific  Likelihood  Function:  Define  the  class- 
specific  likelihood  function  as 

IPq.jZqiXl^y 


9 

T 


X 


n 

t—2 


p(zqul\H0) 

p{zqut \Ho)  J  • 


(8) 


The  maximum  likelihood  (ML)  estimate  of  A*  is  defined  as 

A*  =  ai'gmaxi*(Z;  A*).  (9) 

\z 


The  objective  is  to  derive  an  algorithm  to  solve  (2)  by  solving 

(9). 

4)  Relationship  to  the  Standard  Parameterization  and  Opti¬ 
mality:  It  is  not  clear  yet  that  (8)  is  related  to  (1),  however  in 
fact  we  can  solve  (2)  by  solving  (9).  To  demonstrate  this,  we 
need  to  convert  any  class-specific  parameter  set  A*  into  a  valid 
conventional  parameter  set  A.  This  requires  that  the  PDF  param¬ 
eters  [{fijj. }  -  {UJfc},  {cy^,}]  can  be  converted  into  PDFs  defined 
on  X.  For  this,  we  need  Theorems  1  and  2. 

To  introduce  the  theorems,  we  define  a  feature  set  z  =  T(x). 
Because  T(x)  is  many-to-one,  there  is  no  way  to  reconstruct  the 
PDF  of  x  unambiguously  given  the  PDF  of  z.  However,  Theo¬ 
rems  1  and  2  show  that  given  an  arbitrary  PDF  /,  ( z),  and  ar¬ 
bitrary  feature  transformation  T(x),  it  is  possible  to  construct 
a  PDF  (x)  such  that  when  x  is  drawn  from  (x),  the  distri¬ 
bution  of  z  will  be  /,  ( z).  We  will  also  mention  other  desirable 
properties  that  can  be  attributed  to  (  x) . 

Theorem  1:  Let  the  PDF  px(x\Ho)  be  defined  on  X  and  let 
px(x\Ho)  >  0  for  all  x  £  X.  Let  the  r.v.  z  be  related  to  x  by 
the  many-to-one  feature  transformation  z  =  T(x)  where  T(x) 
is  any  measurable  function  of  x.  Let  Z  be  the  image  of  X  under 
transformation  T(x).  Let  pz(z\Ho)  be  the  PDF  of  z  when  x  is 
drawn  from  the  PDF  px(x\Ho).  Thus,  pz(z\Ho)  >  0  for  all 
z  £  Z.  Let  /,  (z)  be  any  PDF  defined  on  Z.  Then  the  function 
defined  by 


/*(*)  =  ^cr(x)) 


(10) 


is  a  PDF  defined  on  X. 
Proof: 

px(x\H0) 


x  pz(T(x)\H0)h{  {  )}  xlffo\pA(T(x)jHo) 

=E  f 


\pz(z\H0) 

f  -Uz) 

z  €LZ  P~{z\Ho) 


pz(z\H0)dz 


fz(z)dz 


'z  YiZ 


=1. 


The  equivalence  of  the  expected  values  in  lines  one  and  two 
is  an  application  of  the  change  of  variables  theorem  [6],  For 
example,  let  h(z)  be  any  function  defined  on  Z.  If  z  =  T(x), 
then  E,{/i.(z)}  =  E x{h(T(x))}.  This  can  be  seen  when  the 
expected  values  are  written  as  the  limiting  form  of  the  sample 
mean  of  a  siz e-K  sample  set  as  K  — >  oo,  i.e..  Theorem  2. 


Theorem  2:  Let  x  be  drawn  from  the  distribution  (x)  as 
defined  in  (10).  Then  if  z  =  T(x),  the  PDF  of  z  is  /,( z). 

Proof:  Let  Mz  (y)  be  the  joint  moment  generating  func¬ 
tion  (MGF)  of  z.  By  definition. 


Mz(y)  —Ez{eyz} 

=Ex{ey'r<x>} 

=  /  ey'T(x)  p^°)  uT(x))dx 
Jx  ex  Pz(T(x)\Hq) 


'x€ 

=Ex|i/o  i  e 


.y'Tfx)  fz{T{x)) 


-F  [  ry'z 

\  pz{z\Ho) 


Py 


pz(T(x)\H0) 
*0 
>1- 
fz{z) 


i  z  YiZ  pMHo) 

[  eyz  fz(z)dz 
/  zez 


pz(z\H0)dz 


from  which  we  may  conclude  that  the  PDF  of  z  is  /,  (z). 

The  PDF  fx(x)  has  the  following  properties. 

1)  Let  H i  be  some  arbitrary  hypothesis  with  PDF  defined  on 
X.  Then,  when  T(x)  is  a  sufficient  statistic  for  the  binary 
test  of  Hi  versus  Ho ,  then  as  /,  ( z)  -a  p(z\Hi),  we  have 

/*(x)  -*•  Kxl#i)- 

2)  Let  z*  be  a  point  in  Z.  Then 


/x(x) 

px(x\H0) 


h(z*) 

pz(z*\H0) 


for  all  x  such  that  T(x)  =  z* . 


Thus,  fx(x)  has  the  property  that  all  points  x  such  that 
T(x)  =  z*  are  equally  distinguishable  from  Ho- 
3)  Although  Theorems  1  and  2  do  not  impose  any  suffi¬ 
ciency  requirements  on  z,  it  results  that  z  are  sufficient 
statistics  for  the  constructed  PDF.  More  precicely,  z  is  an 
exact  sufficient  statistic  for  the  binary  hypothesis  test  of 
fx(x)  versus  px(xH0). 

We  now  show  that  we  can  solve  (2)  by  solving  (9).  Sup¬ 
pose  that  given  A*,  one  constructed  a  standard  parameterization 
A*  — >  A,  written  A  =  G( A*),  by  constructing  the  PDFs 


py(x;G(A*))^ 


p(x|ffo) 

_p(Tj(x)\H0)_ 


for  1  <j  <  N. 


/rf/jfxkAM. 


(11) 


Note  that,  in  general,  the  reference  hypothesis  can  be  a  func¬ 
tion  of  j ,  written  Hoj-  For  simplicity,  we  have  chosen  to  use  a 
common  reference  Hqj  =  Ho-  That  ji  fix:  G'(  A'))  are  indeed 
PDFs  can  be  seen  from  Theorem  1 .  Furthermore,  from  Theorem 
2,  it  may  be  seen  that  these  densities  are  such  that  they  induce 
the  densities  pj (zj :  A*)  on  zy.  Next,  from  (1),  (4),  (8),  and  (11), 
we  see  that 


Lz(  Z;XZ) 


L(X;G(A*)) 

p(X|fL0) 


(12) 


A  ^  A 

Therefore,  if  we  define  X9  —  G( A*),  we  have 

X9  —  ai'gmaxL(X;  G(A*)). 

A* 

Thus,  we  can  claim  that  X9  maximizes  L(X;  A)  over  all  A  which 
satisfy  A  =  G(XZ)  for  some  AL  Furthermore,  when  the  class- 
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specific  statistics  are  sufficient,  (7)  holds  and  it  follows  from 
(7),  (11)  that  if  pj (zj ;  A~)  ->  p(zj\Hj),  then  py(x;  G( A~)) 
p(x\Hj).  Thus,  one  is  able  to  construct  the  true  HMM  parame¬ 
ters  from  the  class-specifc  parameter  estimates.  Furthermore, 


LA(  Z;V) 


MX;  A) 

Kx|ffo) 


and  the  class-specific  classifier  becomes  the  optimal 
Neyman-Pearson  classifier  for  comparing  competing  HMM 
hypotheses. 


C.  Class-Specific  Baum— Welch  Algorithm 

An  iterative  algorithm  for  solving  (2)  based  on  the  EM 
method,  and  due  to  Baum  [7]  is  available.  Formulas  for 
updating  the  parameters  A  at  each  iteration  are  called  the 
reestimation  formulas  [3],  The  derivation  by  Juang  [8]  is  well 
known  for  the  case  when  pj  (x)  are  Gaussian  mixtures.  We  need 
to  modify  the  derivation  of  Juang  to  solve  (9).  We  may  write 


L*(Z;A*)=£\ 


<71 


M 


U=i 


n 

t- 2 


M 


i-iclt'^2rCqt}JJqt,k(z,qt,CX  ) 


k=l 


(13) 


where 


bMzf,  AA) 


a  A'% -I;/, A 
p(zj\Ho) 


(14) 


This  may  then  be  rewritten  as  (see  Juang  [8,  Eqs.  8-11]) 

^(Z;Ai)  =  ^^p*(Z,0,iT;Ai)  (15) 

9  k 


where  Ek  =  E^iE1Li-"E^i  and 

T 

X  ff  a<lt-l<ltbquki  iZqt,U  A  )Cquki- 
t= 2 

(16) 


We  wish  to  maximize  the  function  LA( Z;  A*)  over  A*.  To  this 
end,  we  seek  an  algorithm  that  given  a  parameter  value  A* ,  we 
can  always  find  a  new  AA/  such  that  LA( Z;  XA/)  >  LA(Z\  A*). 


1 )  Auxiliary  Function: 

Theorem  3:  Define 

Q{ A*')  t  £  E^(Z.  9,  K;  X~)  logp*(Z,  9 ,  K ;  A*'). 

9  k 

(17) 

If  Q(X~,X~')  >  Q(A*,  A*),  then  LA(Z;XAl)  >  LA( Z;XA). 

Proof:  log  ./■  i  s  strictly  concave  for  x  >  0.  Hence,  see  (18), 
shown  at  the  bottom  of  the  page. 

The  inequality  is  strict  unless  p*{Z,  0,  K;  A*')  = 

p*(Z,  0,  K;  A*).  Note  that  this  proof  differs  in  no  mean¬ 
ingful  way  from  Baum’s  [7]  or  Juang’s  [8],  One  important 
difference  is  that  p*(Z,  0,  K;  A*)  is  not  a  PDF.  But  the  proof 
relies  on  Jenssen’s  inequality  which  is  based  on  expected  values 
using  the  probability  measure  p*(Z,0,  K;  X~)/L~{Z]  X~), 
which  is  a  discrete  PDF  due  to  (15). 

2)  Reestimation  Algorithm:  The  problem  now  is  to  solve  for 

maxQ(A*,  A*').  (19) 

We  have 


Q(A*,  A*')  =£  ^>*( Z,  9 ,  K\X>)  logp*(Z,  9 ,  if;  A*') 

9  k 


=£2] 

9  k 


nqi  ^qiM 
T 

X  'Waqt-iqtbqukt(Z‘qt,t)Cqt,kt 
t= 2 
T 

l0g<+^l0g<-l^ 

t= 2 

T  T 

^los6E^K.t)+^loscE^  \-  ^2°) 


t=l 


t=l 


We  then  may  follow  the  proof  of  Juang,  provided  the  necessary 
requirements  of  b*j, (x:  A*)  are  met.  Notice  that  IX  f, (zj :  A*)  de¬ 
pends  on  A*  only  through  the  multivariate  Gaussian  density,  the 
data  zj  may  be  considered  fixed.  Thus,  b*-k(zj;  A*)  meets  the 
necessary  log-concavity  and  elliptical  symmetry  requirements 
necessary  for  the  reestimation  formulas  that  follow.  We  can  pro¬ 
ceed  in  the  proof  of  Juang,  until  it  is  necessary  to  differentiate 
b*k  (.  Zj ;  A* )  with  respect  to  A* .  At  that  point,  the  additional  terms 
in  (14),  not  dependent  on  A*,  do  not  affect  the  solution  of  the 


log 


(  L~(Z]X~’)\ 

U*(Z;A*),/ 


=  log 


p*(Z,  0,  K;  Aa/)\ 

L*(Z; A*)  J 


f  y,  v  p* (Z,  0,  K;  A*)  p*  (Z,  0,  K;  A~')\ 
&\9K  ^(Z;A-)  p*(Z,0,K;Xfi  J 

VVP*(z,g,K;A*) 

Lz^Xz)  &\P*(^0,K-,X^J 

=[L(Z;A")]-1[Q(A%AS/)  -  Q(A*,A*)]  >  0 


(18) 
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maximization;  we  only  need  to  let  zj,t  take  the  place  of  x(.  The 
resulting  algorithm  is  provided  below. 

3)  Class-Specific  Forward  Procedure:  The  joint  probability 
(1)  may  be  calculated  with  the  forward  procedure  [3],  [8]  by 
recursively  computing  the  quantities 

at(i)  =  p(xi ..  =  i]  A). 


Similarly,  the  class-specific  likelihood  function  (8)  may  be  cal¬ 
culated  recursively  by  recursively  computing  the  quantities 

Cf -N  A  Kxl...Xt,(?t  =i]\~) 

at[l)  p(xi...xt|fl'o)  ' 


1)  Initialization: 

a±(i)  —  7 r 


2)  Induction: 


Piizi,  i;A~) 

p(zi:i\H0)  ’ 


1  <  i  <  N. 


«t+i(i)  = 


,i=l 


Py(zy,t+i;A~) 

p(zy,t+i|-^o)  ’ 


1  <t  <  T  -  1,  1  <j<N. 


(21) 


(22) 


3)  Termination: 


L(X;G(A*)) 

KX|^o) 


i=l 


(23) 


4)  Class-Specific  Backward  Procedure:  The  backward  pa¬ 
rameters  ifi(i)  are  similarly  defined. 

1)  Initialization: 


—  1. 


(24) 


2)  Induction: 


y=i 


p(zj>t+1\Ho) 


t  =t-i,t-2,  ...i 

1  <z  <  iV. 


(25) 


5)  HMM  Reestimation  Formulas:  Define  7t(j)  as  p(qt 
jjX).  We  have 


Xt(j)  -  N  • 

Ei=i  wfit w 


(26) 


Let  [see  (27),  shown  at  the  bottom  of  the  page].  The  updated 
state  priors  are 


Ui 


7i  00- 


The  updated  state  transition  matrix  is 

U‘J  —  v^T-l  ■ 

Et=i  7 1(*) 


(28) 


(29) 


Keep  in  mind  that  7t(j),  £t(z,  j),  Uj,  and  d,y  will  be  identical  to 
those  estimated  by  the  conventional  approach  if  (7)  is  true  or  if 
A  =  G(A*). 

6)  Gaussian  Mixture  Reestimation  Formulas:  Let 


and 


7t  (i? m)  =Xt{j) 


CjmM&ht,  lf„r  Ujm) 

I1  j'z .!■>'•  A'  ' 


U~ 


Et=i  7t  Of  rn) 
Er=iEf£i7tc0E)’ 

•jwi  v-^vT1  c  f  ’  \ 

Et=i7tU,w) 

Ef=l  7t  0)  m)(zy,t  -  Mjm)( Zj.t  -  Mjm)' 
Ef=i  7t  (j, m) 


(30) 

(31) 

(32) 


(33) 


III.  Applying  the  Method 

Since  truly  sufficient  statistics  can  never  be  found  in  practice, 
the  practitioner  must  be  satisfied  with  approximate  sufficiency. 
Partially  sufficiency  of  the  features  poses  no  theoretical  prob¬ 
lems  because  the  class-specific  Baum-Welch  algorithm  maxi¬ 
mizes  the  true  likelihood  function  without  requiring  sufficiency, 
albeit  subject  to  the  constraint  that  the  state  PDFs  are  of  the  form 
(11).  As  theory  guides  the  practice,  the  sufficiency  of  the  form 
(6)  which  is  required  for  the  optimality  of  the  standard  approach 
tells  practitioners  to  look  for  features  which  discriminate  be¬ 
tween  the  states.  In  contrast,  the  sufficiency  (7)  required  for  the 
optimality  of  the  class-specific  approach  tells  practitioners  to 
look  for  features  which  discriminate  states  from  Hq  and  whose 
exact  PDF  can  be  derived  under  Hq. 

Approximations  of  {p|(zy|ifo)}  may  also  be  used  as  long 
as  these  approximations  are  valid  in  the  tails.  Tail  behavior  is 
important  because  as  samples  diverge  from  Hq,  the  denomina¬ 
tors  in  (8)  approach  zero.  Approximations  with  accurate  tail  be¬ 
havior  are  available  for  a  wide  range  of  important  feature  sets 
in  signal  processing  including  autocorrelation  and  cepstrum  es¬ 
timates  [9], 

A.  Example 

The  following  conceptual  example  illustrates  the  selection 
of  class-specific  features.  Consider  a  Markov  process  in  which 
there  are  three  states  characterized  by 

•  H i :  a  low-order  autoregressive  process  (such  as  a  whistle) 
of  unknown  variance; 

•  H>:  a  pure  tone  of  unknown  frequency,  amplitude,  and 
phase  in  additive  Gaussian  noise  of  unknown  variance; 

•  Hq:  a  positive-valued  impulse  of  duration  1  sample  in  ad¬ 
ditive  Gaussian  noise  of  unknown  variance. 


&0E) 


_ <(Oac(Pi(zA*+i;Aa)/p(zy,t+i|go))/'jtc+i(i) _ 

Ef=l  Em=l  «t  W«tefc(Zn.,t+li 


(27) 
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Let  Ho  be  independent  zero-mean  Gaussian  noise  of  variance  1 . 
At  each  time  step  t,  a  length- AT  time-series  xt  =  [ait  i  •  •  •  J-'i.iA 
is  generated  according  to  the  state  in  effect. 

1 )  Feature  Selection:  Desirable  features  are  those  that  are 
approximately  sufficient  to  distinguish  the  given  hypothesis 
from  Ho  and  have  a  distribution  known  under  Ho  -  Consider  the 
following  feature  sets. 

•  H i :  We  use  a  second-order  set  of  autocorrelation  estimates 


zi  =  [f0,  r i,  r2\. 

•  H>:  Let  {A',).  1  <  i  <  K  be  the  length- AT  FFT  of  xt. 
We  use  the  index  and  squared  value  of  the  largest  FFT  bin 
(flmax  =  max?  |AT,:|2),  and  the  average  power 

z2  —  [tmaxi^maxdo]' 

•  Ho  '.  We  use  the  time  index  and  value  of  the  largest  input 
sample  (.rma;s  =  max*,  and  the  average  power 

z3  —  [^maxi  It- maxi  t  o]* 


These  feature  sets  are  approximately  sufficient  to  discriminate 
the  corresponding  state  from  fi(l,  while  being  low  in  dimension. 

2 )  Feature  Dependence  on  Ho-'  Notice  that  r o  is  included  in 
each  feature  set  because  the  variance  for  each  state  is  unknown, 
while  it  is  fixed  under  Ho.  Thus,  the  variance  estimate  has  infor¬ 
mation  that  can  discriminate  against  Ho-  If  Ho  had  an  unknown 
variance,  r(l  itself  would  be  irrelevant  in  distinguishing  the  input 
data  from  Ho  and  could  be  discarded,  however,  it  would  be  nec¬ 
essary  to  first  normalize  the  other  features. 

3 )  Obtaining  Exact  PDF  under  Ho-'  For  each  of  feature  sets 
shown  above,  the  exact  joint  PDF  of  the  statistics  can  be  derived 
under  the  Ho  assumption. 

•  Hi :  For  zi ,  it  is  necessary  to  use  a  specific  autocorrelation 
function  (ACF)  estimator  whose  distribution  is  known. 
The  PDF  of  the  FFT-method  ACF  estimates  is  known  ex¬ 
actly  [10],  [11]  and  approximations  are  available  with  ac¬ 
curate  tail  behavior  for  other  ACF  estimators  [9], 

•  H2:  The  FFT  bins  are  Gaussian,  independent,  and  identi¬ 
cally  distributed  under  Ho-  However,  notice  that  f(l  is  not 
statistically  independent  of  a2lax .  The  statistic 


1 

AT-  1 


E  i-Y- 


however,  is  independent  of  a2lax  when  zmax  is  specified. 
Thus,  if 


Zo  —  CL- 


max?  ®maxj  r()J 


we  have 

P(?2\Ho)  —  P(^maxl (maxi  Ho)  pit  oKmaxi  Ho)  p()l max|-^o)* 

(34) 

Each  term  in  (34)  has  a  known  PDF.  Notice  that  except 
for  a  scale  factor,  the  first  term  in  (34)  is  distributed  y2(2) 


(exponential),  and  the  second  is  y2(2AT  —  2),  and  the  last 
term  is  a  uniform  distribution.  It  is  then  possible  to  obtain 
io  from  the  pair  (f'0,  a2lax).  The  PDF  of  Z2  is  then  easily 
found  by  a  change  of  variables. 

•  Ho'-  We  obtain  ji(zo\Ho)  in  essentially  the  same  manner 
as  p(z2\Hq),  however  the  time  dimension  takes  the  place 
of  the  frequency  index  and  ,cmax  is  Gaussian. 

IV.  Conclusion 

In  this  paper,  we  have  demonstrated  that  it  is  possible  to  pa¬ 
rameterize  a  HMM  using  different  features  for  each  state.  This 
parameterization  requires  that  the  exact  densities  of  the  state -de¬ 
pendent  feature  sets  be  known  for  some  fixed  “common"  hy¬ 
pothesis  Ho  and  that  these  densities  are  nonzero  for  the  allow¬ 
able  range  of  the  random  variables.  The  method  can  lead  to 
an  optimal  classifier  if  these  feature  sets  are  sufficient  statistics 
for  discrimination  of  the  corresponding  state  from  the  common 
state  Ho-  In  practice,  this  means  that  more  information  can  be 
extracted  from  the  raw  data  for  a  given  maximum  PDF  dimen¬ 
sion.  In  principle,  the  reference  hypothesis  does  not  need  to  be 
common  and  can  be  a  function  of  the  state;  however,  we  have 
not  explored  this  possibility  in  this  paper. 
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